Introduction: Breast cancer

Materials and methods

Materials

Kaggle breast cancer data

26 explanatory variables.

  • Condition
  • Tumor type
patient_id gender education treatment_data id_healthcenter id_treatment_region hereditary_history birth_date age weight thickness_tumor marital_status marital_length pregnency_experience giving_birth age_FirstGivingBirth abortion blood taking_heartMedicine taking_blood_pressure_medicine taking_gallbladder_disease_medicine smoking alcohol breast_pain radiation_history Birth_control(Contraception) menstrual_age menopausal_age Benign_malignant_cancer condition
111036008041 0 4 2019 1.11e+09 1.11e+09 1 1989 30 69 0.90 1 0 0 0 0 0 4 0 1 1 0 0 1 1 1 1 0 1 death
111035996130 0 6 2019 1.11e+09 1.11e+09 0 1989 30 71 0.80 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 2 0 0 death
111035971333 0 5 2019 1.11e+09 1.11e+09 0 1989 30 74 0.90 1 0 0 0 0 1 4 1 1 0 0 0 1 1 0 1 0 1 death
111036018485 0 5 2019 1.11e+09 1.11e+09 1 1989 30 75 0.70 1 1 1 3 1 0 2 1 1 1 1 0 0 0 0 2 0 0 death
111035985474 0 1 2019 1.11e+09 1.11e+09 0 2009 10 70 0.25 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 death
111035903616 0 3 2019 1.11e+09 1.11e+09 1 1989 30 79 0.70 0 0 0 0 0 0 6 1 1 1 0 1 1 1 1 1 0 1 death
111036003507 0 4 2019 1.11e+09 1.11e+09 1 1990 29 96 0.10 0 0 0 0 0 0 4 1 1 0 0 0 1 1 0 2 0 1 death
111036026259 0 5 2019 1.11e+09 1.11e+09 0 1990 29 75 0.80 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 2 0 0 death

Process

Cleaning the data

Unrealistic age and weight proportions.

Cleaning the data

Column names
Removed special characters from variable names
Variable values
Greedy cleanup of binary variables
Blood type had non-conforming entries which were set to NA
Birth dates of other than 4 numbers were set to NA
Filter out samples
Only include women > 20 years, as they are the primary risk group
Remove abnormal attribute combinations
Removing columns
Remove singular columns (with only one value for all samples)

Augmenting the data

Values changed
Categorical variables as factor
Explicit values for categorical variables
Adding columns
Age at treatment
Normalised numerical variables

Exploratory Analysis

Distributions

Distributions

Distributions

Distributions

MCA

MCA rotation

Model

Predicting Condition

Reduced Model

  • Education
  • Age
  • Weight
  • Tumor type
  • Hereditary history
  • Giving birth
  • Age FirstGivingBirth
  • Radiation history
  • Menstrual age
  • Abortion
  • Breast pain
Model Sensitivity Specificity Balanced accuracy
Max_pred 88% 42% 65%
Red_pred 90% 48% 69%
baseline 100% 0% 50%
Note:
Positive class = Death

Shiny app

Shiny App

Discussion & Conclusion

Discussion

  • Greedy cleaning approach
  • Disagreement between MCA and reduced model
  • A general set of rules for valid entries

Conclusion

  • Possible to predict both tumor type and outcome.
  • The prediction accuracy.
  • Shiny app

Bibliography

MCA

Predicting tumortype

Reduced Model

  • Age (norm)
  • Weight (norm)
  • Hereditary history
  • Smoking
  • Radiation therapy
  • Menstrual age
  • Pregnancy experience
  • Abortion
  • Breast pain
model sensitivity specificity balanced_accuracy
Max_pred 69% 28% 48%
Red_pred 81% 24% 53%
baseline 100% 0% 50%
Note:
Positive class = Malignant